Goto

Collaborating Authors

 exploding and vanishing gradient


Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients?

Neural Information Processing Systems

We give a rigorous analysis of the statistical behavior of gradients in a randomly initialized fully connected network N with ReLU activations. Our results show that the empirical variance of the squares of the entries in the input-output Jacobian of N is exponential in a simple architecture-dependent constant beta, given by the sum of the reciprocals of the hidden layer widths. When beta is large, the gradients computed by N at initialization vary wildly. Our approach complements the mean field theory analysis of random networks. From this point of view, we rigorously compute finite width corrections to the statistics of gradients at the edge of chaos.



Reviews: Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients?

Neural Information Processing Systems

The paper studies the gradient vanishing/exploding problem (EVGP) theoretically in deep fully connected ReLU networks. As a substitute for ensuring if gradient vanishing/exploding has been avoided, the paper proposes two criteria: annealed EVGP and quenched EVGP. It is finally shown that both these criteria are met if the sum of reciprocal of layer widths of the network is a small number (thus the width of all layers should ideally be large). To confirm this empirically, the paper uses an experiment from a concurrent work. Comments: To motivate formally studying EVGP in deep networks, the authors refer to papers which suggest looking at the distribution of singular values of the input-output Jacobian.


Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients?

Hanin, Boris

Neural Information Processing Systems

We give a rigorous analysis of the statistical behavior of gradients in a randomly initialized fully connected network N with ReLU activations. Our results show that the empirical variance of the squares of the entries in the input-output Jacobian of N is exponential in a simple architecture-dependent constant beta, given by the sum of the reciprocals of the hidden layer widths. When beta is large, the gradients computed by N at initialization vary wildly. Our approach complements the mean field theory analysis of random networks. From this point of view, we rigorously compute finite width corrections to the statistics of gradients at the edge of chaos.


Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients?

Hanin, Boris

Neural Information Processing Systems

We give a rigorous analysis of the statistical behavior of gradients in a randomly initialized fully connected network N with ReLU activations. Our results show that the empirical variance of the squares of the entries in the input-output Jacobian of N is exponential in a simple architecture-dependent constant beta, given by the sum of the reciprocals of the hidden layer widths. When beta is large, the gradients computed by N at initialization vary wildly. Our approach complements the mean field theory analysis of random networks. From this point of view, we rigorously compute finite width corrections to the statistics of gradients at the edge of chaos.


Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients?

Hanin, Boris

Neural Information Processing Systems

We give a rigorous analysis of the statistical behavior of gradients in a randomly initialized fully connected network N with ReLU activations. Our results show that the empirical variance of the squares of the entries in the input-output Jacobian of N is exponential in a simple architecture-dependent constant beta, given by the sum of the reciprocals of the hidden layer widths. When beta is large, the gradients computed by N at initialization vary wildly. Our approach complements the mean field theory analysis of random networks. From this point of view, we rigorously compute finite width corrections to the statistics of gradients at the edge of chaos.